1 



Some aspects of simulation algorithms for dynamical fermions * 

Karl Jansen a , Beat Jegerlehner fc and Chuan Liu a 



On 
u 

Oh- 
< 



"Dcutschcs Elcktroncn-Synchroton DESY, 
Notkestr. 85, 22603 Hamburg, Germany 

^Max-Planck-Institut fiir Physik, 

Fohringer Ring 6, 80805 Munchen, Germany 

Three topics concerning fermion simulation algorithms are discussed: 1.) A performance comparison of the 
multiboson technique to simulate dynamical fermions and the Kramers equation algorithm, 2.) the question 
of reversibility in the Hybrid Monte Carlo algorithm and 3.) the implementation of Symanzik's improvement 
program for dynamical Wilson fermions. 
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1. Introduction 

Wilson's suggestion to formulate QCD on an 
euclidean space-time lattice opened the path to 
non-perturbative and first principle investigations 
of strong interactions, ft was soon realized that 
for pure Yang- Mills theory numerical simulations 
are a powerful tool to address several questions 
in this theory. In particular, numerical simula- 
tions can build a bridge between the high en- 
ergy regime, where perturbation theory is valid, 
to scales where non-perturbative effects set in. 

However, as the title of this workshop already 
suggests, simulations with dynamical fermions 
have still to be accelerated in order to obtain re- 
sults with small enough error bars in a reasonable 
amount of time. Therefore, in the last years we 
have seen a revived interest in a better under- 
standing of the existing algorithms (the old work 
horses) as we have seen new developments. 

Such a development is the multiboson tech- 
nique to simulate dynamical fermions Q and we 
will in section 2 compare its performance to the 
Kramers equation algorithm Section 3 is 

concerned with the question of lack of reversibil- 
ity in the Hybrid Monte Carlo algorithm || which 
may cause potential problems with the detailed 
balance condition in practical simulations. The 
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4th section will contain a short account of the 
implementation of Symanzik's improvement pro- 
gram for Wilson fermions. The overhead due to 
the addition of the extra Sheikholeslami-Wohlert 
(SW) term |5j in the lattice action will be dis- 
cussed. Note that in sections 2 and 3 the gauge 
group is chosen to be SU(2) while in section 4 it 
is SU(3). 

Space limitations do not allow to give here a 
full account of all the notations and background. 
For this we have to refer to the original literature 

iill- 



2. Performance comparison of the multi- 
boson technique to simulate dynamical 
fermions and the Kramers equation al- 
gorithm 

In this section, the performance of two algo- 
rithms is compared. The first one goes back to 
a suggestion of Horowitz ||] and is called the 
Kramers equation algorithm. In a free field anal- 
ysis it gives the same value of the dynamical crit- 
ical exponent, z = 1, as the Hybrid Monte Carlo 
(HMC) algorithm ||. However, whereas in the 
HMC algorithm this value of z is only reached in 
the limit of a large number of molecular dynam- 
ics steps, the Kramers equation algorithm needs 
only one step. The second algorithm is based on 
a new technique that makes use of the fact that 
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the QCD partition function can be rewritten as a 
path integral of -in principle- infinitely many bo- 
son fields with interactions given by the fermion 
matrix on the lattice. Fortunately enough, it 
turned out that for practical simulations only a 
moderate number of these boson fields is required 
to obtain a good approximation of the QCD par- 
tition function || . The tests of the Kramers equa- 
tion algorithm and the multiboson technique have 
been performed for Wilson fermions. For a de- 
tailed description and investigation of the two al- 
gorithms see H and ||. For alternative realiza- 
tions of the multiboson technique see |ll]] and for 
a review |L0]]. Let us here only emphasize the 
improvements over the "bare" algorithm that we 
have installed. Starting with the Kramers equa- 
tion algorithm, we 

• installed standard even-odd precondition- 
ing H 

• and used a particular integration scheme 
that allows for larger step sizes [[l3| . 

For the multiboson technique we 

• used again even-odd preconditioning, 

• introduced a normalization factor for the 
fermion matrix that lifts the lowest eigen- 
value while keeping the largest one below 
one 0,1 

• and made a careful investigation of the op- 
timal choice of the mixing ratio of heatbath 
and over-relaxation updates || . 

Let us mention that each of the above improve- 
ments may lead to substantial factors of acceler- 
ating the program. We also looked at the chrono- 
logical extrapolation method to find better start- 
ing vectors for the conjugate gradient algorithm 
fl4|| . However, due to the use of large step sizes 
in the Sexton- Weingarten integration scheme, we 
did not find a further acceleration of the program 
in our case. 

The algorithm and coupling parameters used 
for the tests may be found in Q. We worked 
at a rather large pion to p-mass ratio of 0.95. 
All numerical simulations have been performed 



on the Alenia Quadrics (APE) massively paral- 
lel machines. In table 1, we give the time r in 
real seconds that the algorithms need to gener- 
ate an independent configuration. The subscript 
b stands for the multiboson technique and k for 
the Kramers equation algorithm. The computa- 
tionally most expensive part in the Kramers equa- 
tion algorithm is the conjugate gradient method 
for the matrix inversion. This inversion, on the 
other hand, is dominated by fermion matrix Q |j| 
times vector 4> operations, denoted by Also 
for the bosonic algorithm similar floating point 
operations dominate the program |]^|. For this 
reason we give the autocorrelation time t[(2<I>] 
in units of Q$ in columns 3 and 4 which is a 
more machine independent measure of the per- 
formance. 

We see that within the error bars for the auto- 
correlation time, both algorithms show a compa- 
rable performance. 

3. Reversibility 

In order for the HMC algorithm to fulfill the de- 
tailed balance condition, reversibility of the equa- 
tions of motion has to hold. However, problems 
with the reversibility condition have been encoun- 
tered and in this section some aspects of these 
problems will be discussed. 

In the HMC algorithm the fields are evolved ac- 
cording to a non-linear first order stochastic dif- 
ferential equation. In the continuum, such differ- 
ential equations are candidates for describing the 
time evolution of a classical chaotic system. Of 
course, for the computer, the differential equation 
has to be discretized. Nevertheless, one might 
expect that for small enough discrete step sizes 
one is close enough to the continuum and that 
one would observe a chaotic behaviour. To test 
this proposal, let us define a quantity, ||d?7||, to 
measure the difference between two gauge field 
configurations 

IW = ^£(W- W) 2 - (!) 

Here U*(x), V^(x) are two SU(2) gauge link vari- 
ables with lattice point index x, direction /i and 
group index a. O is the lattice volume. 
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Table 1 

Performance comparison of the Kramers equation and the boson algorithms. Subscripts b and k stand 
for the multiboson technique and the Kramers equation algorithm, respectively. 



Lattice 


machine size 


T 6 [Q$] 


T k [Q<S>] 


r b [sec] 


r k [sec] 


6 3 12 


Ql [8 nodes] 


14000(800) 


21000(4000) 


298(20) 


480(100) 


8 3 12 


Ql [8 nodes] 


36000(2250) 


17000(5000) 


1781(112) 


979(300) 


16 4 


QH2 [256 nodes] 


56000(18600) 


26000(11000) 


990(330) 


540(230) 



To show that the suggested chaotic behav- 
ior is really a property of Hamilton's equations 
of motion, we proceeded in the following way. 
Given a gauge field configuration obtained in 
the course of some run, we added a small noise 
5U t _ l (x) to the gauge field variable U^(x), such 
that Vfj,(x) = U t _ l (x)+8U l _ l (x). Then, we took both 
configurations and iterated them according to the 
leapfrog integration scheme used in the HMC al- 
gorithm. We measured \\dU\\ after some number 
of steps N m d in the leapfrog integration. If the 
system is chaotic, we expect that asymptotically 
(eN md > 1) 

\\dU\\ = Ae veNmd . (2) 

In eq. (||) v is the -to be determined- Liapunov 
exponent, characterizing a chaotic system and e 
is the step size used in the program. 

Indeed, in H the asymptotic exponential grow 
of 1 1 <it/ 1 1 was verified and a positive Liapunov ex- 
ponent v » 0.75 was extracted. To see whether 
this effect also happens in a real Monte Carlo run 
with dynamical Wilson fermions using the HMC 
algorithm we did the following: When a trajec- 
tory has been integrated, we reversed the time 
and integrated back, measuring ||d?7|| using the 
initial configuration before starting the leapfrog 
integration and the final configuration at the end 
of the reversed trajectory. By fixing the step size 
e = 0.03, we obtained \\dU\\ as a function of the 
trajectory length eN m d- Indeed, we find again a 
linear behavior of Zg(||<iJ7||) with a Liapunov ex- 
ponent around 0.75. A similar observation has 
been made in the case of SU(3) gauge fields [ ^5| . 

An important consequence of this observation 
is that the necessary condition of reversibility in 
the detailed balance proof of the HMC algorithm 
may be violated in practical simulations. Round- 
ing errors appearing in the course of the simula- 



tion are blown up exponentially and destroy the 
reversibility of Hamilton's equations of motion. 
We tried to estimate, how serious this effect is for 
a real simulation. We ran two HMC programs 
simultaneously: One with 32 bit arithmetic but 
with all scalar products, dot products etc. in 
double precision (which mirrors the case of the 
APE version of our program). The second one 
running with complete 64 bit arithmetic. The 
Metropolis decision in the HMC algorithm is on 
AH, i.e. the difference of the start and the end 
Hamiltonian. We measured the difference S(AH) 
between AH of the program with 32 bit and 64 
bit precision. This was done for each step in the 
evolution 5(AH) ste p and for a complete trajec- 
tory, S(AH) t raj- We find that on a 16 4 lattice 
S(AH) tra j to be larger than 5(AH) ste p, indicat- 
ing that indeed a growth of the errors due to the 
chaotic nature is showing up. 

Repeating this test on lattices of sizes 8 4 , 
12 4 and 16 4 , we found that by a simple ex- 
trapolation in the lattice size, for a 32 4 lat- 
tice the ratio S(AH) tra j / AH rj 0.05, whereas 
5{AH) step /AH w 0.01. We believe that a dif- 
ference of AH between the 32 bit and the 64 bit 
program versions reaching this level, may lead to 
observable effects in physical quantities. 

4. Implementation of Symanzik's improve- 
ment program 

Following Symanzik ]Tq| , in order to cancel 
the 0(a) effects in physical on-shell observables, 
it is sufficient for Wilson fermions to add a 
term suggested by Sheikholeslami and Wohlert 
f§, henceforth called a SW-term. In 0, it 
has been demonstrated in the quenched approx- 
imation that the O(a) effects are substantial for 
Wilson fermions. There it was also shown that 



4 



for a complete cancellation of 0(a) effects a 
non-perturbative determination of the coefficients 
multiplying the improvement terms is necessary. 
Given the experience of this work, it is a very nat- 
ural next step to also implement the SW-term for 
dynamical fermions. The force according to the 
SW-term can be derived straightforwardly and it 
can be shown that even-odd preconditioning can 
be maintained. For a detailed description of our 
implementation and the design of possible tests 
of the code see . 

When comparing with the conventional Wilson 
QCD simulation, the timing of the version with 
the improved fermion action is the crucial piece 
of information. 
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Figure f. The CPU time for the algorithm of 
the improved action is plotted as a function of 
the number of conjugate gradient iterations per 
step (Ncg) relative to the conventional Wilson 
fermion simulation. The solid and the dashed 
lines correspond to the choice of tffl = and 
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50 in eq. (^) respectively. 



In a typical QCD simulation using molecu- 
lar dynamics algorithms, the conjugate gradient 
(CG) iterations dominate the CPU time of the 
program. Therefore, in the limit of number of 
conjugate gradient iterations Ncg going to infin- 
ity, the time of the matrix- vector multiplication is 



the most crucial part. In our version of Symanzik 
improvement, this multiplication turns out to be 
only slightly slower than the conventional Wilson 
case by about 15 percent. The force evaluation, 
in this case, adds some overhead to the program 
which does not depend on Ncg- It is convenient 
to express all times in units of the matrix-vector 
multiplication of the conventional Wilson case. 

To be specific, we propose the following formula 
for the time of the program: 

Twuson = 4 0) + (4 X) +t^)N CG , 

Tsw = t^ + it^+t^NcG ■ (3) 

In the above formula, the coefficient t w ' = 2 is, by 
definition, the number of matrix-vector multipli- 
cations needed for each CG iteration in the con- 
ventional Wilson case. The coefficient t$ repre- 
sents the cost of other operations for each CG it- 
eration (linear combinations, inner products etc.). 
Typically, this coefficient is quite small. The co- 
efficient tw^ is the overhead that does not depend 
on Ncg- The value of tw depends on the im- 
plementation of the program. However, in the 
asymptotic region where Ncg is large, the effect 
of tffl becomes irrelevant. 

The quantities t^2, tiw and tiw are the cor- 
responding coefficients for the program with the 
Sheikholeslami-Wohlert action. In this case, 
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2.3 is slightly larger than the corresponding 
value in the Wilson case. The value of tiw — tw 
remains the same and we find, for our implemen- 
tation of the SW-term on the APE computer, the 
coefficient t'fw ~ (50 + tffl) to be larger than that 
of the Wilson case. In Fig. we plot the CPU 
time of our program with improved fermions rel- 
ative to the conventional Wilson case as a func- 
tion of the number of CG-iterations per molecu- 
lar dynamics step. The solid and dashed curves 
in the figure correspond to the choice of tw = 
and fc^ = 50 respectively, which we consider to 
be two rather extreme cases. For typical simu- 
lations, the curve should lie between these two. 
In any case, we see that, for the most interest- 
ing physical situation in which Ncg ~ 200 or 
more, the simulation with improved fermions is 
only about 20% slower, quite independent of the 
value of t 
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5. Conclusions 

We presented a performance test of the multi- 
boson technique to simulate dynamical fermions 
and the Kramers equation algorithm. We find 
that these two quite different simulation tech- 
niques for fermions give a comparable perfor- 
mance. Approaches to make the multiboson 
technique exact by adding a Metropolis step are 
promising: The number of boson fields can be 
chosen to be small while keeping large acceptance 
rates [[llj . A detailed performance comparison of 
a version of an exact multiboson program and the 
Kramers equation or HMC algorithm is, however, 
not yet available. 

Possible problems with the lack of reversibility 
in the discretized Hamilton's equations of motion 
have been discussed in section 3. A rough esti- 
mate leads to the conclusion that on lattices with 
size 32 4 on machines with 32 bit arithmetic the 
lack of reversibility can give serious problems. For 
the Kramers equation algorithm this problem is 
substantially reduced, since one only would need 
one molecular dynamics step. 

Coming back to the title of the workshop, 
did we now accelerate the fermion algorithms? 
The answer is yes and comes from a quite un- 
expected direction. The SW-term as required 
by Symanzik's improvement program can be im- 
plemented straightforwardly for Wilson fermions. 
Fig. 1 demonstrates that the slow down of the 
program is only about 20%. On the other hand, 
by determining the coefficients multiplying the 
improvement terms non-perturbatively, all O(a) 
effects in physical on-shell observables can be 
eliminated which results in a much faster ap- 
proach to the continuum limit, in which we are 
most interested in. 
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